Final Project

Author

Ava Foster

Predictive Factors of Age on Purchasing

1. Introduction

There has been much talk lately of the differing behaviors of the different generations, especially as Gen-Z is about to enter the adult stage of life. Online there has been a lot of dialogue. For example, the catchphrase "ok, boomer" gained significant attention, comparing millennials and Gen-Zs is a common trend, and Generation Alpha has entered their school-age years which has led to online discourse about their lack of social skills due to technology. The different general behaviors of the generations has many implications for marketers. This study aims to better understand how age influences purchase behavior. The Baby Boomer generation is made up of those born between 1945 and 1963 which includes people from age 60 to age 78. The next generation, Generation X, is made up of people born between 1964 and 1978 which includes people from age 45 to age 59. Generation X is followed by Millenials who were born between 1979 and 1993 which includes people between the age of 30 and 44. The last generation relevant to this study is Generation Z, born between 1994 and 2011 who are now between the ages of 12 and 29. 

To better understand how age influences shopping behavior, I used a data set from Kaggle that compiled information on consumers that, "includes demographic information, purchase history, product preferences, and preferred shopping channels (online or offline) (Kaggle)." The data was last updated October, 2023. Each row of data represents a different individual consumer. 

Here is a snapshot of 5 randomly chosen rows of the data set we'll use:

# A tibble: 5 × 22
  customer_ID   age gender item_purchased category  purchase_amount_USD location
        <dbl> <dbl> <chr>  <chr>          <chr>                   <dbl> <chr>   
1        3633    27 Female Sneakers       Footwear                   73 Utah    
2        3127    57 Female Sunglasses     Accessor…                  76 Maine   
3         999    51 Male   Shoes          Footwear                   90 Connect…
4         417    36 Male   Belt           Accessor…                  55 Oregon  
5        3410    24 Female Shirt          Clothing                   93 Minneso…
# ℹ 15 more variables: size <chr>, color <chr>, season <chr>,
#   review_rating <dbl>, subscription_status <chr>, shipping_type <chr>,
#   discount_applied <chr>, promo_code_used <chr>, previous_purchases <dbl>,
#   payment_method <chr>, forequency_of_purchases <chr>, age_group <fct>,
#   generation <fct>, numeric_age_group <dbl>, numeric_generation <dbl>

2. Exploratory Data Analysis

We had an original sample size of 3,900. None of the participants in the sample had missing objects so our total sample size remained 3,900.

Table 1. Summary Statistics by generation of number of participants, mean and standard deviation for previous_purchases.

# A tibble: 4 × 4
  generation   count mean_previous_purchases sd_previous_purchases
  <fct>        <dbl>                   <dbl>                 <dbl>
1 Baby Boomers   863                    25.9                  14.3
2 Generation X  1118                    25.6                  14.5
3 Generation Z   802                    24.6                  14.3
4 Millenials    1117                    25.2                  14.6

Table 2. Summary Statistics by generation of mean and standard deviation of review rating and mean and standard deviation amount purchased in USD.

# A tibble: 4 × 5
  generation   mean_review sd_review mean_purchase_amount sd_purchase_amount
  <fct>              <dbl>     <dbl>                <dbl>              <dbl>
1 Baby Boomers        3.75     0.710                 59.4               23.9
2 Generation X        3.72     0.721                 60.0               23.8
3 Generation Z        3.79     0.715                 60.4               23.9
4 Millenials          3.75     0.716                 59.4               23.3

Table 3. Summary Statistics by age group of number of participants by age group and mean and standard deviation of previous purchases.

# A tibble: 10 × 4
   age_group count mean_previous_purchases sd_previous_purchases
   <fct>     <dbl>                   <dbl>                 <dbl>
 1 18-24       418                    24.1                  14.1
 2 25-29       384                    25.1                  14.5
 3 30-34       371                    25.0                  14.6
 4 35-39       361                    25.1                  14.8
 5 40-44       385                    25.5                  14.6
 6 45-49       338                    23.6                  14.1
 7 50-54       382                    26.3                  14.6
 8 55-59       398                    26.5                  14.5
 9 60-64       363                    25.2                  14.2
10 65-70       500                    26.4                  14.3

Table 4. Summary Statistics by age group of mean and standard deviation of review rating and mean and standard deviation of amount purchased in USD.

# A tibble: 10 × 5
   age_group mean_review sd_review mean_purchase_amount sd_purchase_amount
   <fct>           <dbl>     <dbl>                <dbl>              <dbl>
 1 18-24            3.81     0.730                 59.7               23.7
 2 25-29            3.77     0.699                 61.0               24.1
 3 30-34            3.76     0.717                 60.6               23.2
 4 35-39            3.73     0.698                 59.5               23.5
 5 40-44            3.76     0.734                 58.2               23.0
 6 45-49            3.71     0.714                 57.1               23.6
 7 50-54            3.72     0.742                 63.1               23.9
 8 55-59            3.72     0.708                 59.4               23.6
 9 60-64            3.73     0.727                 59.4               23.4
10 65-70            3.77     0.698                 59.3               24.3

The generation with the greatest mean amount of previous purchases were the Baby Boomers (n = 863, mean = 25.93, median = 26, sd = 14.26), however when split by age group, the group with the greatest mean amount of previous purchases were those aged 55 to 59 (n = 398, mean = 26.28, median = 27, sd = 14.58), members of Generation X. However, the group with the second greatest mean amount of previous purchases were those aged 65 to 70 (n = 500, mean = 26.42, median = 27, sd = 14.29) who are indeed apart of the Baby Boomer generation.

The generation who spent the greatest average amount of money in USD on their purchase was Generation Z (n = 802, mean = 60.36, median = 61, sd = 23.88). However, the age group who spent the greatest average amount were aged 50 to 54 (n = 382, mean = 63.06, median = 64, sd = 23.90), Generation Xers, followed by the Generation Z age group between the ages of 25 and 29 (n = 384, mean = 61.04, median = 61.5, sd = 24.07).

Finally, the generation with highest average review rating were the Generation Z (n = 802, mean = 3.79, median = 3.8, sd = 0.71) and Millenials (n = 1117, mean = 3.75, median = 3.8, sd = 0.72). The age group with the highest average review rating were ages 18 to 24 (n = 418, mean = 3.81, median = 3.9, sd = 0.73), Generation Zers. They were followed by those aged 25 to 29 (n = 384, mean = 3.77, median = 3.8, sd = 0.73), the other Generation Z age group.

Exploratory analysis leads us to ask if there are any other variables across age that influence purchase behaviors.

3. Multiple Linear Regression

3.1.1. Model 1 Methods

The components of my multiple linear regression model are the following:

  • outcome variable y1 = purchase amount in USD

  • Numerical explanatory variable x1 = age

  • Categorical explanatory variable x2 = frequency of purchases

We want to know if the relationship between age and purchase_amount_USD is conditional on one’s frequency of purchase.

Table 5. Regression table for interaction model of amount purchased in USD as a function of age and frequency of purchases.

# A tibble: 14 × 7
   term                 estimate std.error statistic  p.value conf.low conf.high
   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
 1 (Intercept)          65.7        3.03     21.7    7.30e-99  5.98e+1 71.7     
 2 age                  -0.125      0.0640   -1.95   5.13e- 2 -2.50e-1  0.000680
 3 forequency_of_purch… -8.87       4.29     -2.06   3.91e- 2 -1.73e+1 -0.445   
 4 forequency_of_purch… -4.78       4.22     -1.13   2.57e- 1 -1.31e+1  3.49    
 5 forequency_of_purch… -5.24       4.33     -1.21   2.26e- 1 -1.37e+1  3.24    
 6 forequency_of_purch… -8.31       4.38     -1.90   5.81e- 2 -1.69e+1  0.285   
 7 forequency_of_purch… -8.24       4.30     -1.92   5.53e- 2 -1.67e+1  0.188   
 8 forequency_of_purch… -1.51       4.43     -0.342  7.33e- 1 -1.02e+1  7.18    
 9 age:forequency_of_p…  0.213      0.0923    2.31   2.11e- 2  3.20e-2  0.394   
10 age:forequency_of_p…  0.105      0.0905    1.16   2.48e- 1 -7.29e-2  0.282   
11 age:forequency_of_p…  0.0915     0.0927    0.987  3.24e- 1 -9.03e-2  0.273   
12 age:forequency_of_p…  0.167      0.0933    1.80   7.27e- 2 -1.54e-2  0.350   
13 age:forequency_of_p…  0.180      0.0910    1.98   4.76e- 2  1.94e-3  0.359   
14 age:forequency_of_p…  0.00698    0.0940    0.0742 9.41e- 1 -1.77e-1  0.191   

3.1.2. Model 1 Results

  • Since “annually” comes first alphabetically, people who shop annually are the “baseline comparison group”. Therefore, the intercept (b0 = 65.75) represents the intercept for only the annual group.
  • The estimate for the slope for age (bage = -0.12) is the associated change in purchase amount for every increase of one year in age. Every increase of one year, there is a 0.12 decrease in amount purchased.

  • The estimate for the following purchasing frequencies are the offsets in intercept relative to the annual group (baseline).

3.1.3. Model 1 Interpretation

Using the output of our regression table we'll test two different null hypotheses. The first null hypothesis is that there is no relationship between age and amount purchased in USD at the population level (the population slope is zero).

There appears to be a possible negative relationship between age and amount purchased in USD for consumers Bage = -0.12. However, this does not appear to be a meaningful relationship since in the table, we see

  • the 95% confidence interval for the population slope Bage (-0.250, 0.00068), zero is included within this interval

  • Although, the p-value (p = 0.051) is less than 0.1, there is still weak evidence against the null hypothesis

The null hypothesis cannot be confidently rejected

The second set of null hypotheses that we are test are that all the differences in intercept for the non-baseline groups are zero.

  • the 95% confidence intervals for the population difference in intercept Bquarterly (0.032, 0.28) and Bbiweekly (-17.29, -0.45) are the only ones that do not include 0. So it is plausible that the difference of all intercepts, except Bquarterly and Bbiweekly, are zero, hence it is plausible that all intercepts are the same.

  • The majority of the p-values are too large to reject the null hypothesis. However, the p-value for Bquarterly is 0.021 and Bbiweekly is 0.039.

3.2.1. Model 2 Methods

The components of my multiple linear regression model are the following:

  • outcome variable y1 = purchase amount in USD

  • Numerical explanatory variable x1 = previous purchases

  • Categorical explanatory variable x2 = age group

We want to know if the relationship between amount of previous purchases and purchase_amount_USD is conditional on one’s age group.

Table 6. Regression table for interaction model of amount purchased in USD as a function of age group and previous purchases.

# A tibble: 20 × 7
   term                estimate std.error statistic   p.value conf.low conf.high
   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)          60.3       2.29     26.4    4.91e-141   55.9      64.8  
 2 previous_purchases   -0.0253    0.0820   -0.309  7.57e-  1   -0.186     0.135
 3 age_group25-29       -1.35      3.33     -0.405  6.85e-  1   -7.87      5.17 
 4 age_group30-34        0.273     3.35      0.0817 9.35e-  1   -6.29      6.83 
 5 age_group35-39       -0.746     3.36     -0.222  8.25e-  1   -7.34      5.85 
 6 age_group40-44       -2.10      3.34     -0.630  5.29e-  1   -8.65      4.44 
 7 age_group45-49       -4.69      3.40     -1.38   1.68e-  1  -11.4       1.98 
 8 age_group50-54        6.35      3.39      1.87   6.12e-  2   -0.297    13.0  
 9 age_group55-59       -1.58      3.37     -0.468  6.40e-  1   -8.18      5.03 
10 age_group60-64       -1.26      3.42     -0.370  7.12e-  1   -7.96      5.44 
11 age_group65-70       -3.19      3.19     -1.00   3.17e-  1   -9.46      3.07 
12 previous_purchases…   0.107     0.117     0.915  3.60e-  1   -0.122     0.336
13 previous_purchases…   0.0259    0.118     0.221  8.25e-  1   -0.205     0.256
14 previous_purchases…   0.0228    0.118     0.193  8.47e-  1   -0.208     0.254
15 previous_purchases…   0.0253    0.116     0.217  8.28e-  1   -0.203     0.254
16 previous_purchases…   0.0880    0.123     0.716  4.74e-  1   -0.153     0.329
17 previous_purchases…  -0.113     0.117    -0.964  3.35e-  1   -0.342     0.116
18 previous_purchases…   0.0482    0.116     0.417  6.77e-  1   -0.179     0.275
19 previous_purchases…   0.0399    0.120     0.332  7.40e-  1   -0.195     0.275
20 previous_purchases…   0.108     0.111     0.976  3.29e-  1   -0.109     0.325

3.2.2. Model 2 Results

  • First, since 18-24 comes numerically before the other age groups, the 18-24 age group is the "baseline for comparison" group. Thus, intercept is the intercept for the 18-24 group.

  • This holds similarly for previous_purchases. It is the slope for previous_purchases for only the 18-24 group. Thus, the regression line will have an intercept of 60.34 and slope for previous_purchases of -0.025. 

  • The values for the following age groups are not their intercepts, but rather the offset in intercept for that specific age group relative to the 18-24 age group. The intercept for the other age groups are the intercept + the estimate for said age group.

  • Similarly, the age groups x the previous_purchases are not the slopes for the other age groups, but rather the offset in slope for those age groups. Therefore, the slopes for age groups are age group estimate + age group x previous_purchases estimate. 

3.2.3. Model 2 Interpretation

The first null hypothesis is that there is no relationship between previous purchases and amount purchased in USD at the population level (the population slope is zero).

There appears to be a possible negative relationship between previous purchases and amount purchased in USD for consumers Bpreviouspurchases = -0.025. However, this does not appear to be a meaningful relationship since in the table, we see

  • the 95% confidence interval for the population slope Bpreviouspurchases (-0.19, 0.0.14), zero is included within this interval

  • The p-value (p = 0.76) is much greater than 0.1, there is no evidence against the null hypothesis

The null hypothesis cannot be rejected.

The second set of null hypotheses that we are test are that all the differences in intercept for the non-baseline groups are zero.

  • All of the 95% confidence intervals contain zero, therefore it is plausible that all intercepts are the same.

  • All of the p-values are too large to reject the null hypothesis.

3.3.1. Model 3

The components of my multiple linear regression model are the following:

  • outcome variable y1 = review rating

  • Numerical explanatory variable x1 = age

  • Categorical explanatory variable x2 = discount applied

We want to know if the relationship between age and review rating is conditional on whether or not their was a discount applied to their purchase.

Table 7. Regression table for interaction model of review rating as a function of age group and whether or not there was a discount applied.

# A tibble: 4 × 7
  term                   estimate std.error statistic p.value conf.low conf.high
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)             3.86      0.0468      82.6   0       3.77e+0  3.95    
2 age                    -0.00237   0.00100     -2.36  0.0185 -4.34e-3 -0.000397
3 discount_appliedYes    -0.153     0.0709      -2.15  0.0313 -2.92e-1 -0.0137  
4 age:discount_appliedY…  0.00306   0.00152      2.01  0.0443  7.74e-5  0.00604 

3.3.2. Model 3 Results

  • Since “no” comes first alphabetically, people who did not have a discount applied are the “baseline comparison group”. Therefore, the intercept (b0 = 3.86) represents the intercept for only the group who did not receive a discount.
  • The estimate for the slope for age (bage = -0.0024) is the associated change in review rating for every increase of one year in age. Every increase of one year, there is a 0.0024 decrease in review rating.

  • The estimate for the group that did get a discount applied (Bdiscountappliedyes = -0.15) is the offset in intercept relative to the group who did not get a discount (baseline).

3.3.3. Model 3 Interpretation

The first null hypothesis is that there is no relationship between age and review rating at the population level (the population slope is zero).

There appears to be a possible negative relationship between age and review rating for consumers Bage= -0.0024. There appears to be a meaningful relationship since in the table, we see

  • the 95% confidence interval for the population slope Bage (-0.00434, -0.00040), zero is not included within this interval

  • The p-value (p = 0.019) is greater than 0.05, but still less than 0.1, indicating weak evidence against the null hypothesis

Therefore, the relationship does indeed appear to be negative.

The second null hypothesis that we are testing is that the difference in intercept for the non-baseline group is zero.

  • The 95% confidence interval for the group who received a discount is (-0.29, -0.014). This interval does not contain zero, therefore it is not plausible that the intercept is the same.

  • The p-value is 0.031 which is greater than 0.05, but still less than 0.1 indicating weak evidence against the null hypothesis.

Because the previous two null hypotheses could not be rejected, we must address the third null hypothesis that there is no relationship between the interaction of age and discount applied and review rating.

  • The 95% confidence interval for the interaction is (0.000074, 0.0060). This interval does not contain zero, therefore it is not plausible to reject the null hypothesis.

  • The p-value is 0.044 is less than 0.05 indicating that there is moderate evidence against the null hypothesis.

4. Conclusions

We found that (1) there was no significant difference in the amount purchased in USD based on age for people with different frequencies of purchases, (2) there was no significant difference in the amount purchased based on number of previous purchases for people in different age groups, (3) There is moderate evidence for a difference in review rating based on age for people who got a discount versus those who did not. For Model 3, we found that when there was no discount applied, as age increased, review rating decreased. When there was a discount applied, as age increased, review rating increased. For model 1, we found moderate evidence that as age increased, purchase amount increased for those who purchased biweekly and quarterly.

# A tibble: 12 × 5
   term                          estimate std.error statistic  p.value
   <chr>                            <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                   58.4        2.86     20.4    3.32e-88
 2 age                            0.0308     0.0614    0.502  6.16e- 1
 3 payment_methodCash             0.803      3.97      0.202  8.40e- 1
 4 payment_methodCredit Card      0.386      4.00      0.0965 9.23e- 1
 5 payment_methodDebit Card       0.789      4.08      0.193  8.47e- 1
 6 payment_methodPayPal           6.66       4.07      1.64   1.02e- 1
 7 payment_methodVenmo            4.61       4.06      1.13   2.57e- 1
 8 age:payment_methodCash        -0.0185     0.0851   -0.218  8.28e- 1
 9 age:payment_methodCredit Card -0.00106    0.0852   -0.0125 9.90e- 1
10 age:payment_methodDebit Card   0.00925    0.0876    0.106  9.16e- 1
11 age:payment_methodPayPal      -0.162      0.0875   -1.85   6.44e- 2
12 age:payment_methodVenmo       -0.123      0.0876   -1.40   1.61e- 1